Ideas for a garbage collector for memory-banked systems

Posted by pulkomandy on Thu Feb 29 21:20:52 2024 • Comments (0) •

Well, apparently I'm not ready to give up on Smalltalk for the Amstrad CPC yet.

The previous attempt showed that it is not possible to fit even a rather simplified Smalltalk interpreter into the main memory of the CPC. I guess this comes as a surprise to no one, since there are 16KB used by the display, about 8 used by the OS, and so only around 40K left for everything else.

It turns out, others have tried this and somewhat succeeded. Pocket Smalltalk for the Palm Pilot promizes to fit a complete game (Dungeoneers) in a similar space (I have not managed to run the game and have no idea if it is good). The Palm is a bit quirky in the way it works, but also, Pocket Smalltalk abandons the idea of editing things directly in the live Smalltalk image, instead it is designed as a cross development environment, more similar to a Java virtual machine in a way. That's why I had not considered it in my porting attempts. Well, maybe I should revisit it.

There is also TerseTalk, but that seems a bit too limited for me (only 256 objects?) and also not implemented yet (the code is extremely incomplete)

On my CPC, I currently have 2MB of memory, so, surely there would not be any problems running a larger version of Smalltalk. Maybe even Smalltalk-80 could be made to work somehow. IT will probably be slow. But would it even be possible?

The Z80 CPU cannot access all this memory linearly. Instead, all we get is a 16KB window that can be moved to any part of that 2MB space. This is one of the annoying things to handle with native code. Some people choose to keep their work running on the CPC 464, and not use any of this extended memory, to save the headaches. (Not attacking or ridiculing anyone here, it is a valid choice and there are other reasons for it). Anyway, I got thinking about how the abstractions of Smalltalk could make it possible to handle this memory mapping scheme transparently for the developper.

So I started to read about garbage collectors. This post is an attempt to ogganize these thoughts and see if we can come up with a design that will have acceptable performance. I will continue updating it as I continue thinking about it...

Garbage collection basics

First let's talk about the general idea of dynamic memory allocation. The basic need is simple: an application is running and it wants to use memory for temporary storage, and release it later when not needed anymore, so it can be reused by something else.

One way to handle this is a very simple API like the one available in C: malloc() and free(). The application tells the memory allocator when it needs a new block of memory (nidicating how large the block should be), and gets a pointer to that block. It returns the block to the allocator when it does not need it anymore.

This seems as simple as it gets. Just two functions. But, it turns out, the job of the memory allocator is quite complicated. While the application keeps track of the currently allocated memory blocks, the allocator is responsible for keeping track of the ones that are not in use. This will tipically be managed with a "free list" of all the blocks that are not currently allocated. This can be implemented in a quite simple way, but there can be tricky problems, for example if the application forgets to free some block, or even worse, if it tries to free the same block multiple times. And even if the application has no bugs, this approach has another limitation: as long as a block is allocated and in use, it is not possible to move it in memory to a different address. This is because the application has a direct pointer to the block, and the allocator does not know what the application is doing with it.

The idea of garbage collectors is to solve some of these problems. From the API side, the problems all come from the free() function. It is hard to mis-use the malloc function. At worst, the application asks for more memory than is available, and gets an error instead of a valid pointer. The free() function however, can be misused, by calling it with addresses that were not returned by malloc, by calling it twice with the same address, by forgetting to call it, by trying to free blocks where information stored by the memory allocator has been erased, etc.

So, let's remove that function! Problem solved. This is the end of the article. Thanks for reading.

Oh wait. Of course that introduces a new problem. How is the allocator going to know when a piece of memory can be reclaimed and put back into the free list? Garbage collectors solve this problem by analyzing the content of the memory block. This, of course, requires some level of cooperation from the application. The general idea is that the garbage collector can access the data structures stored in the allocated blocks, and identify if they have references to other blocks. A block that is not referenced by anything else is unreachable by the application, and so, its memory can be reused.

The trick now is thinking about an efficient way to find such unreferenced blocks. But, since the garbage collector knows how to access the application memory and identify references in there, it may also write this memory and update the references to point to some new place. This gives us a new tool: it is now possible to move existing blocks in memory to a different place. As we will see, this opens a lot of new possibilities. The obvious one is that, after a while, with a typical malloc/free allocator, the memory becomes fragmented, and there are only very small pieces left. There is no easy way to get out of that situation. But with a garbage collector, you can. This makes garbage collection well suited to small memory systems, where fragmentation will quickly be a problem.

Establishing our invariants

I have read some texts about garbage collectors, and most of them tend to present algorithms starting with the earliest ones, and introducing the new ideas gradually. This is a good idea, but I think I will make a change still: first I want to introduce some terms and explanations. The thing is, each of the algorithm is described in its original research paper with its own terminology, especially for the earlier ones. This, for me, makes it harder to think about how these algorithms can be combined together, which part of one can be reused in another, etc. We will need this, because most of the research on garbage collectors is done with much larger machines in mind, and no banked memory (at least the ones I could easily find. I am interested if you have papers or algorithm descriptions for something using banked memory).

The abstraction I found very useful is the tricolor invariant. It formalizes a bit what we need to do. Let's have a brief look at it.

The situation is that we have a set of objects in our memory. These objects have references (pointers) to each other. What we need to do is, starting from a root object or set of objects, determine another set of objects where we are sure that, by following any number of references, it is not possible to reach that second set starting from the first one.

If the root set is correctly chosen, this will allow to determine a set of objects that cannot be reached in any way from the application code. These objects are unused, and their memory can be reused.

So, how does a garbage collector goes about doing that? In very general terms, and without giving any specific algorithm, we can establish a solution that is proven to work. This is exactly what the tricolor invariant is.

What it does is attribute a color to each object in the system. There are 3 different colors. It could have been red, green and blue, but due to the technical limitation of printing technologies at the time, most litterature uses black, white and grey instead. I will stick with these to keep things simple.

That's the "tricolor" part explained. Now let's talk about the invariant. An invariant is a condition that must always be verified. The invariant is:

A black object must never have a direct reference to a white object.

At first this does not seem helpful. But it is the key to understanding all garbage collection algorithms. Let's look at our 3 set of objects now:

There is a set of black objects. They may reference grey ones and other black ones, but never white ones.
There is a set of white objects. They may reference anything.
There is a set of grey objects. They may reference anything as well.

So, because of the invariant, if we want to reach an object from another, we have a constraint (which is essentially a different way to express the invariant):

All paths from a black object to a white object will at some point travel through a grey object

Now, let's consider a special case. What if there are no grey objects in the system? In this case, because of the invariant, we are sure that there are no references from any of the black objects to the white ones.

I think you can see where we're going with this: if the "black" set is the objects (potentially) reachable by the application, and the grey set is empty, we are sure that the white set is made of unused objects, which can be deleted.

So, we can now redefine the job of the garbage collector as a two step process:

Step 1: colorize all objects such as the invariant is verified
Step 2: modify the coloring of objects to try to reduce the number of grey objects to 0
Step 3: when there are no grey objects, the remaining white objects are unused and their memory can be reclaimed

We have to eliminate one edge case: the stupid garbage collector that just makes all objects white. Then the invariant is verified, there are no grey objects, and so everything can be deleted. Simple, but not very useful. So, the initial setup in step 1 must have either one black or one grey object. Usually this is called the root: the object(s) that allow to reach all others. In Smalltalk, that is the object references currently in the application stack.

As I have written, this does not describe any specific algorithm just yet. Garbage collectors have to decide on various things, including: how to remember the color of nodes, how to assign the initial colors, and how to process the nodes to reduce the number of grey ones.

malloc/free, again

Having established all this theory, let's look at the practical example of a typical malloc/free allocator.

Initially all memory is unreachable by the application. We can consider this a "white" memory block.

Every call to malloc takes some memory and makes it available to the application. Since the content of that memory is not yet filled by the application, it cannot at this time contain any reference to anything in the application, and even less any references to any other "white" (unused) area of memory. So, we can safely mark this block as black.

free() is used to return memory to the application. In theory, we don't know if the application still has references to that block in any other part of its "black" memory. So, we should mark this block black or grey. But, since we have no way to know what the application is doing, we don't have any algorithm to make this block white again. So, we have to trust the application that it is correct, and mark the block white.

It seems difficult to make anything simpler. But, if we look closer, we notice two things: first, it is not as simple as it looks. malloc() is given just a size, and it can't magically find a block of "white" memory where that fits. So, there will be some data structure to do that. We will see later how garbage collectors solve this. For a typicall malloc/free implementation, this will need to store a pointer or size at the start of each black memory block, and also at the start of white blocks, but since these are not used by the app, it's not a problem.

The second problem is more important, it's the way the invariant is maintained true. Essentially, we have delegated our job back to the application. This is what makes malloc/free an unreliable solution: the application developper probably does not want to think about memory allocation too much, and surely they will occasionally get it wrong. That's the problem we are trying to solve here. So, let's move on to better options.

The garbage collector in Little Smalltalk

Let's see how Little Smalltalk handles this. According to the comments in the sourcecode, this is the Baker two-space algorithm. It is relatively simple.

First let's look at the way new objects are allocated. We start from the first available byte in memory. At every allocation, we increment a pointer by the size of the allocated object. That's it. Objects are allocated next to each other this way, until we reach half of the available memory. Then, the garbage collector starts working.

As we know, it must perform its job in three steps.

Step 1: colorize all objects such as the invariant is verified

All objects in the garbage collected area are first considered white, except the root, which is grey.

Since there are no black objects, the invariant is verified.

Note: the root can be either the first object in the memory, or it can be stored outside of the memory in some separate storage (for example, in Smalltalk, the execution stack is the root and is not typically a dynamically allocated object).

Step 2: modify the coloring of objects to try to reduce the number of grey objects to 0

Get the next grey object.

Analyze each reference in that object. Mark all the referenced objects as grey if they are still white.

When done, mark the object black.

Continue with the next grey object.

Eventually, there will be no more grey objects and the algorithm is complete. In the worst case, this stops when all white objects have been reached and marked grey, and eventually marked black after processing.

So far, this is how you would obviously implement such an algorithm. The trick is in how the algorithm remembers which objects are white, grey of black.

In Baker's two space algorithm, the "black" and "grey" objects are moved to the second half of the memory, while the white objects remain in the first half. This is pretty interesting, because it means all the information needed is already stored in the object addresses. No additional data structures at all are needed. Pretty impressive. But that comes at the cost of not being able to use half of the memory. Maybe that's a bit much.

Other algorithms instead reserve 1 or 2 bits in the objects to store the current color (only 1 bit of grey objects are remembered by some other means, for example, storing their addresses in a stack). That's likely more efficient, especially if you can find some available space for that extra bit. For example, if you have a 16-bit size field in objects, you could easily use one of these bits for the color. No one is going to fill the entire RAM with a single object, that would be stupid. So the size won't ever get that large.

Anyway, there is another problem with that general algorithm. We started by making almost all objects white. Since white objects have to be moved to a new place in memory, that means eventually moving all "live" objects into a new place, and also updating all references to these objects in other objects. If we do this for a large memory (say, 512KB or 2MB) at once, this is going to lock the program from running for quite a long time. Incremental garbage collectors solve this by making only one subset of objects white. But it is not so simple to do that, since we must make sure that the invariant is established (no black objects point to white ones).

Specificities of the Amstrad CPC

One thing we need to keep in mind is the CPC memory is addressed in banks. This means we can only access 16K at once (well, not exactly, but let's keep this simple). Is there a garbage collector taking this into account?

There does not seem to be a lot of research on such things. Which makes sense: when garbage collectors for larger memory became a reasonable thing to do, it was at least on 16 or even 32 bit CPUs, and possibly with virtual memory, and even hardware acceleration to help the GC on some LISP machines. All entertaining reading, but not really what I need.

Note: after writing the first version of this article I found about this article from Lieberman and Hewitt, which is quite similar to the solution I ended up with. It turns out, there probably is more research then you think about these things, it's just not immediately obvious where to find it.

So, I have to get back to the basics and consider how I can tweak them for my need. The idea is to work one 16K page at a time. When the page is full, do a garbage collecting cycle on it, so that we don't leave too many unused objects behind. Then, move on to the next 16K page and start over. Hopefully, we can then have one "in use" page with the recently allocated objects, some of which will have a short life and be garbage collected as soon as the page is full. And we have only a limited number of accesses to objects in other pages.

But, how do we implement that? Since we don't want to go through all of the memory, we can take some inspiration from more complex garbage collectors.

The algorithm can be used to do a complete garbage collection of all the system. This is done by first making all objects "white" (this is easy: just create any new object with the marking bit set to the "white" value). Then, start from the "root" (the references to objects in the program stack), and mark all referenced objects as grey. Then one by one, take one grey object, add all its referenced objects to the grey list, and so on, until there are no more grey objects.

When you are done, all allocated objects are now "black". Then, the memory can be accessed directly (page by page, not using object references), and the available space can be reused for allocating new objects. If you are used to C style memory allocators, you would then start adding all this free space to some kind of free list or something like that. But that's a bad idea: you are using memory to keep track of your free memory. With a garbage collected system, what you can do instead, is move objects around so that the remaining "black" objects are all tightly packed at the start of memory. Then, allocations simply consists of appending new objects at the end. Since each object has a field indicating its size, you can then use that info to figure out where the next object in memory is, and allocations are just incrementing the "last object" pointer by adding the size of the object to it.

Of course, when moving objects around, you must update references to them. In the two spaces method, this is done by putting a special marker at the old place where the object was, indicating its new address. As objects are moved to the new space, they can access this info. Once everything is moved to the new space, the old space is empty and has only unneeded redirections. But again, doing it this way would require going through all memory.

The tricolor scheme is better than that. One interesting thing is, if you read the above carefully, at some point I said "first, make all objects white". But this is a stronger requirement than what we really need. We have 1 condition that we must satisfy: black objects must not point to white objects directly, only to other grey or black ones. What if, initially, all our objects are black instead of white? This also verifies that condition, and the algorithm can run from there.

This opens some new possibilities. It means we don't have to perform garbage collection on all the RAM everytime. If we manage to build a small set of white objects, it can be scanned quite fast and without having to go in all the memory banks. For example, let's start by making all objects initially black. Then, we create a new object (in the currently active bank) and make it white. Surely we will store a reference to this object somewhere soon, and we will store it in another object. Since all our other objects are black, we are breaking our invariant condition here (we would have a black object pointing to a white one). So, we have two solutions: make the new object grey, or make the old one that points on it grey. Both of these preserve the condition. Making the new object grey is not very interesting, however: quickly, all our new objects will all become grey, then be considered by the GC as still used, and never deleted. So we must take the other option.

When we decide to run our garbage collection (for example, because the working memory bank is full), we know that it can only delete objects from the current bank (the only ones that can be white at this time). The "grey" list of references was built as we modified any other objects in other banks. We can check these objects, but we can optimize that check: since all banks other than the current one contains only grey and black objects, we only need to check references to the current bank. All other references would either point to a black object (that has already been scanned), or a grey object (that will be scanned already). We will not learn anything by checking these. And so we can quickly scan the current bank, find if all objects in it are still referenced, and then compact it and start over. We can do this until there is only one free 16K bank left. Then, we need to do a complete GC pass on all the RAM, that will take a longer time, but it will clean all remaining objects that became unreferenced after they were pushed out of the current bank.

Note an important thing here: the current bank always ends up being full, then garbage collected and the remaining objects moved elsewhere. There is also a lot of back and forth between this current bank and all the others during the gc phase (but also probably during running the Smalltalk interpreter). So, I suggest the following layout:

0000-3FFF: The "current bank" with the newly allocated objects (sharing space with other things if needed: interrupt handlers, stack, global variables for the interpreter)
4000-7FFF: In main RAM, the screen memory. Overlaid with memory banks being accessed when the VM is running. Just map the bank out when printing something to the screen.
8000-FFFF: Interpreter and GC code, with the AMSDOS variables in the middle.

So we maintain two pointers for memory allocations: one in the 0000-3FFF area for freshly created objects. And one for the 4000-7FFF area in banks, for the longer lifetime objects. Once the 0000-3FFF RAM is full of objects, we do a small GC cycle of if to take care of the recently created and short-lived objects, then we move the remaining things to banks, and we start over. Until there are no banks left. Then, one big GC cycle to compact all the banks (if we ever manage to fill that much RAM, that is). We may even save ourselves the trouble for the "big GC": instead of compacting the RAM in place, if you run out of RAM, just save your session to disk. This is like a GC but still-referenced objects are written to disk instead of another RAM space. Then reload it. That's it, you just defragmented your whole memory!

One problem with this system remains: Little Smalltalk normally represents objects using 16-bit "cells" (a cell can be either: a pointer to another object, the object size, or a direct integer value. The VM uses the low bit to distinguish integers from pointers (pointers are always to an even address; integers are shifted by 1 and the low bit is always set for example. We could change this and use the highest bit instead, since all our pointers to Smalltalk objects would be in 0000-7FFF). But we have to store info in all pointers/references to know which bank it is pointing to. This would mean switching to 24-bit cells. This makes everything more memory hungry by 50%. Or, we could keep the internal pointers as 16 bit values, and have an indirection table somewhere to find the right destination. For example let's say the interpreter code is now moved to a ROM, so we can do this:

0000-3FFF: The "current bank" with the newly allocated objects (sharing space with other things if needed: interrupt handlers, stack, global variables for the interpreter)
4000-7FFF: In main RAM, the indirection table. Overlaid with memory banks being accessed when the VM is running. Just map the bank out when printing something to the screen.
8000-BFFF: AMSDOS and firmware stuff, variables for the interpreter if needed.
C000-FFFF: In main RAM, screen memory (write-only). In ROM, the Smalltalk interpreter code.

So we have 16K bytes indirection table. If each reference needs 3 bytes, that's enough for 5461 objects. Maybe the table can go up to 9FFF, and then we have enough for 8000 objects. That would be enough to fill the 2MB memory if each object is exactly 256 bytes. Is this a reasonable compromise? I think objects will typically be smaller than that. So maybe using 24-bit cells is better. But then, the indirection table also simplifies things when moving objects around since you can use the table to redirect to the new address of an object when it is moved. Otherwise, you have to update all objects which are pointing to it. Not necessarily a big deal, this can be done at the same time as scanning the grey objects: look at the grey object, see that it references something white in the "new allocations" area, immediately move the white thing (if not already moved) to a more persistent storage in extended memory, and update the grey object with the new address.

Well, all of this easier said than done. I guess I have to write an implementation now?

PulkoMandy's homepage